library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
set.seed(1)
Ydata <- runif(1000)
df <- tibble(y=Ydata)
for (i in 1:100){
df[[paste0('x',i)]] <- runif(1000)
}
lm_fit <- lm(y ~ .,data = df)
smry <- summary(lm_fit)
i_count <- 0
for (i in 1:101){
if (smry$coefficients[,4][i] < 0.05){
i_count = i_count +1
}
}
(i_count)
## [1] 6
From the above results, there exists some predictors whose marginal p-values are below \(0.05\). Since the value changes with each run of the data, my current estimate is \(8\) which could potentially decrease or increase for another run.
f <- smry$fstatistic
(smry)
##
## Call:
## lm(formula = y ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6043 -0.2257 -0.0055 0.2321 0.6903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.903e-01 1.624e-01 3.634 0.000295 ***
## x1 -3.421e-02 3.272e-02 -1.046 0.296042
## x2 -3.559e-02 3.346e-02 -1.064 0.287722
## x3 2.026e-02 3.192e-02 0.635 0.525739
## x4 3.439e-02 3.264e-02 1.054 0.292339
## x5 8.913e-03 3.451e-02 0.258 0.796229
## x6 -2.417e-03 3.328e-02 -0.073 0.942125
## x7 -2.892e-03 3.300e-02 -0.088 0.930175
## x8 -5.956e-03 3.364e-02 -0.177 0.859503
## x9 -6.429e-03 3.221e-02 -0.200 0.841867
## x10 -8.359e-02 3.356e-02 -2.491 0.012931 *
## x11 -2.250e-02 3.359e-02 -0.670 0.503127
## x12 3.285e-02 3.264e-02 1.007 0.314343
## x13 -4.862e-02 3.306e-02 -1.470 0.141788
## x14 1.617e-02 3.407e-02 0.475 0.635140
## x15 3.012e-02 3.287e-02 0.916 0.359790
## x16 1.070e-02 3.353e-02 0.319 0.749609
## x17 -1.057e-03 3.331e-02 -0.032 0.974697
## x18 4.892e-02 3.346e-02 1.462 0.144109
## x19 -1.775e-02 3.338e-02 -0.532 0.595063
## x20 5.202e-03 3.371e-02 0.154 0.877391
## x21 -2.509e-02 3.360e-02 -0.747 0.455389
## x22 1.128e-02 3.247e-02 0.347 0.728327
## x23 1.241e-02 3.252e-02 0.382 0.702889
## x24 5.281e-03 3.403e-02 0.155 0.876726
## x25 -7.305e-02 3.365e-02 -2.171 0.030220 *
## x26 -5.417e-02 3.408e-02 -1.589 0.112305
## x27 -4.579e-02 3.371e-02 -1.359 0.174623
## x28 -4.443e-02 3.312e-02 -1.341 0.180147
## x29 1.407e-02 3.388e-02 0.415 0.677925
## x30 -9.658e-03 3.305e-02 -0.292 0.770140
## x31 7.948e-03 3.271e-02 0.243 0.808062
## x32 -1.185e-02 3.329e-02 -0.356 0.721916
## x33 -2.701e-02 3.381e-02 -0.799 0.424638
## x34 2.995e-02 3.291e-02 0.910 0.363060
## x35 -8.298e-03 3.394e-02 -0.244 0.806913
## x36 -3.681e-02 3.385e-02 -1.087 0.277172
## x37 2.545e-02 3.432e-02 0.742 0.458559
## x38 4.045e-02 3.401e-02 1.189 0.234641
## x39 -1.459e-02 3.320e-02 -0.439 0.660435
## x40 1.811e-02 3.355e-02 0.540 0.589509
## x41 3.742e-02 3.270e-02 1.145 0.252677
## x42 4.224e-02 3.344e-02 1.263 0.206867
## x43 1.692e-02 3.362e-02 0.503 0.614843
## x44 -3.306e-02 3.387e-02 -0.976 0.329308
## x45 -4.702e-02 3.351e-02 -1.403 0.160906
## x46 -4.877e-02 3.242e-02 -1.504 0.132870
## x47 2.837e-03 3.289e-02 0.086 0.931287
## x48 1.652e-02 3.373e-02 0.490 0.624510
## x49 1.244e-02 3.372e-02 0.369 0.712346
## x50 2.164e-02 3.307e-02 0.654 0.513049
## x51 -6.217e-02 3.259e-02 -1.908 0.056758 .
## x52 -2.343e-02 3.317e-02 -0.707 0.480031
## x53 -4.156e-02 3.379e-02 -1.230 0.219001
## x54 -9.515e-03 3.255e-02 -0.292 0.770111
## x55 3.855e-02 3.326e-02 1.159 0.246745
## x56 9.661e-03 3.284e-02 0.294 0.768696
## x57 -3.555e-02 3.241e-02 -1.097 0.273089
## x58 3.764e-02 3.194e-02 1.179 0.238907
## x59 -3.186e-02 3.271e-02 -0.974 0.330411
## x60 -6.571e-03 3.423e-02 -0.192 0.847795
## x61 -2.071e-02 3.296e-02 -0.628 0.529965
## x62 -4.984e-03 3.404e-02 -0.146 0.883643
## x63 5.211e-02 3.294e-02 1.582 0.113972
## x64 1.063e-02 3.251e-02 0.327 0.743698
## x65 -2.326e-02 3.303e-02 -0.704 0.481373
## x66 -1.772e-02 3.257e-02 -0.544 0.586491
## x67 -1.758e-03 3.329e-02 -0.053 0.957903
## x68 -7.429e-03 3.389e-02 -0.219 0.826521
## x69 5.449e-02 3.317e-02 1.643 0.100738
## x70 -2.354e-02 3.324e-02 -0.708 0.478981
## x71 2.314e-02 3.300e-02 0.701 0.483478
## x72 1.995e-02 3.314e-02 0.602 0.547460
## x73 3.226e-02 3.281e-02 0.983 0.325817
## x74 3.779e-02 3.341e-02 1.131 0.258356
## x75 4.316e-02 3.284e-02 1.314 0.189110
## x76 2.805e-02 3.290e-02 0.853 0.393981
## x77 3.638e-02 3.305e-02 1.101 0.271291
## x78 -4.685e-04 3.281e-02 -0.014 0.988611
## x79 -7.308e-03 3.313e-02 -0.221 0.825489
## x80 6.049e-04 3.315e-02 0.018 0.985444
## x81 1.248e-02 3.315e-02 0.376 0.706730
## x82 -6.622e-02 3.339e-02 -1.983 0.047648 *
## x83 7.077e-02 3.319e-02 2.133 0.033227 *
## x84 5.766e-02 3.282e-02 1.757 0.079290 .
## x85 2.976e-03 3.269e-02 0.091 0.927473
## x86 -1.527e-02 3.256e-02 -0.469 0.639202
## x87 1.617e-03 3.394e-02 0.048 0.962015
## x88 3.140e-02 3.202e-02 0.981 0.327044
## x89 -2.132e-02 3.314e-02 -0.643 0.520165
## x90 1.877e-02 3.322e-02 0.565 0.572103
## x91 -1.183e-01 3.338e-02 -3.545 0.000413 ***
## x92 9.479e-05 3.379e-02 0.003 0.997762
## x93 3.530e-02 3.375e-02 1.046 0.295902
## x94 2.124e-02 3.331e-02 0.638 0.523883
## x95 -5.604e-02 3.299e-02 -1.698 0.089772 .
## x96 -1.605e-02 3.299e-02 -0.487 0.626715
## x97 -3.374e-02 3.235e-02 -1.043 0.297275
## x98 9.989e-03 3.252e-02 0.307 0.758765
## x99 3.839e-02 3.422e-02 1.122 0.262228
## x100 -5.713e-02 3.322e-02 -1.719 0.085875 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2884 on 899 degrees of freedom
## Multiple R-squared: 0.1002, Adjusted R-squared: 0.0001042
## F-statistic: 1.001 on 100 and 899 DF, p-value: 0.4814
(f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE))
## value
## 0.481358
The pf as used above takes values for the vector
quantiles, and the degrees of freedom. Here \(f(1)\) is the f-statistic value , \(f(2)\) the total number of degrees of
freedom which is the total number of predictors in all and is the
numerator and \(f(3)\) which is the
residual degrees of freedom and is the denominator. With these values,
\(pf\) gives the distribution function
and is used to calculate the p-value of the F-test.
results <- tibble(total_count = numeric(1000), p_value = numeric(1000))
N <- 1000
for (ij in 1:N){
Ydata <- runif(1000)
df <- tibble(y=Ydata)
for (i in 1:100){
df[[paste0('x',i)]] <- runif(1000)
}
# fitting the linear model
lm_fit <- lm(y ~ .,data = df)
smry <- summary(lm_fit)
# count
i_count <- 0
for (i in 1:101){
if (smry$coefficients[,4][i] < 0.05){
i_count = i_count +1
}
}
# p_value
f <- smry$fstatistic
f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)
results$total_count[ij] <- i_count
results$p_value[ij] <- f_pval
}
cnt <-
results %>%
filter(total_count >= 2) %>%
summarise(n=n())
(cnt)/N
## n
## 1 0.979
cnt <-
results %>%
filter(p_value < 0.05) %>%
summarise(n=n())
(cnt)/N
## n
## 1 0.047
library(ggplot2)
## Boston Data
library(ISLR2)
coeff_data <- tibble(SLR = numeric(12), MLR = numeric(12))
(lm_summary <- summary(lm(crim ~ zn, data = Boston)))
##
## Call:
## lm(formula = crim ~ zn, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
coeff_data$SLR[1] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = zn, y = crim)) +
geom_point() + # scatter plot of data points
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
From the above, it is clear that not enough of the per capita crime rate
by town is well explained by the proportion of residential land zoned
for lots over \(24,000\) sq.ft
zn. This is reflected both by the value of the \(R^2\) and the adjusted-\(R^2\). Moreover, the p-value for the
coefficients is very small below \(0.05\) and almost insignificant. The graph
below confirms this even though there is some small negative
relationship.
(lm_summary <- summary(lm(crim ~ indus, data = Boston)))
##
## Call:
## lm(formula = crim ~ indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## indus 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[2] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = indus, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
From the above, it is clear that not enough of the per capita crime rate
by town is well explained by the proportion of non-retail business acres
per town
indus. This is reflected both by the value of the
\(R^2\) and the adjusted-\(R^2\).They seem a little high which tells
at least some sort of variations in crim explained by
indus but not much. Moreover, the p-values for the
coefficients is very small below \(0.05\) and almost insignificant. Some small
positive relation is seen in the graph.
(lm_summary <- summary(lm(crim ~ chas, data = Boston)))
##
## Call:
## lm(formula = crim ~ chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## chas -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
coeff_data$SLR[3] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = chas, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Also from the above, it is clear that not enough of the per capita crime
rate by town is well explained by the Charles River dummy variable (= 1
if tract bounds river; 0 otherwise),
chas. This is due to
the fact that the chas values are categorical and a simple
linear regression cannot be used to determine the nature of the
relationship between chas and the crim.This is
most reflected by the graph of the fit. It is difficult to relate a
value response to a categorical predictor with a simple linear
regression.
(lm_summary <- summary(lm(crim ~ nox, data = Boston)))
##
## Call:
## lm(formula = crim ~ nox, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## nox 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[4] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = nox, y = crim)) +
geom_point() + # scatter plot of data points
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Even with very small p-values for the estimated coefficients, the
variation in the per capita crime rate by town explained by nitrogen
oxides concentration (parts per 10 million) is somewhat significant and
a spike of positive relation can be seen between the two variables. This
is most seen in the model fit and also the values of the \(R^2\) and adjusted-\(R^2\)
(lm_summary <- summary(lm(crim ~ rm, data = Boston)))
##
## Call:
## lm(formula = crim ~ rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## rm -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
coeff_data$SLR[5] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = rm, y = crim)) +
geom_point() + # scatter plot of data points
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Comparing the per capita crime rate by town to the average number of
rooms per dwelling, it can be seen from the graph that most of the data
is peaked at the center and a spike of a negative relationship is spot
between the two of them. Unfortunately, results from the model fit,
\(R^2\) indicates that not much of the
variation in
crim is explained by the variations in
rm and moreover, the p-values estimated for the
coefficients are very small and almost insignificant .
(lm_summary <- summary(lm(crim ~ age, data = Boston)))
##
## Call:
## lm(formula = crim ~ age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## age 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
coeff_data$SLR[6] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = age, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Even with very small p-values for the estimated coefficients, the values
for the \(R^2\) and adjusted \(R^2\) give some small significance about
the variation in
crim that is explained by the variation in
the proportion of owner-occupied units built prior to 1940. Moreover,
the graph for the fit indicates the positive nature of the relationship
between both variables.
(lm_summary <- summary(lm(crim ~ dis, data = Boston)))
##
## Call:
## lm(formula = crim ~ dis, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## dis -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[7] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = dis, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
There is some small relationship between the per capita crime rate by
town and the weighted mean of distances to five Boston employment
centres as indicated by the graph of the fit which seems to be a
negative relationship. Even amidst this, the estimated p-values for the
coefficients are very small and almost insignificant. Moreover, the
values of \(R^2\) and adjusted \(R^2\) indicate that some small variations
in
crim that is explained by the variations in
dis.
(lm_summary <- summary(lm(crim ~ rad, data = Boston)))
##
## Call:
## lm(formula = crim ~ rad, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## rad 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[8] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = rad, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Although there is some small positive relation between the per capita
crime rate by town and index of accessibility to radial highways, the
distribution of the data in the graph of the fit tells how insignificant
this association could be due to how the data are clustered in vertical
forms. Using such a model would not properly predict values of the the
crim.
(lm_summary <- summary(lm(crim ~ tax, data = Boston)))
##
## Call:
## lm(formula = crim ~ tax, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## tax 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[9] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = tax, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Even with some spike of positive relation between the per capita crime rate by town and the full-value property-tax rate per \(\$10,000\), results from \(R^2\), adjusted-\(R^2\) are quite small with almost insignificant p-values for estimated coefficients. Thus the model is able capture only a few of the data points in graph and the variations in crim cannot be properly explained by the variations in tax.
(lm_summary <- summary(lm(crim ~ ptratio, data = Boston)))
##
## Call:
## lm(formula = crim ~ ptratio, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## ptratio 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
coeff_data$SLR[10] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = ptratio, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Similar to the results in the tax, the graph of the fit between the per
capita crime rate by town and the pupil-teacher ratio by town is
positive, but with majority of the data distributed at the maximum, it
is hard for the variations in crim to be properly accounted for by the
ptratio as this is illustrated by values of the \(R^2\) and the p-values of the estimated
coefficients.
(lm_summary <- summary(lm(crim ~ lstat, data = Boston)))
##
## Call:
## lm(formula = crim ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## lstat 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[11] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = lstat, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
The distribution of data points in the graph of the per capita crime rate by town to the lower status of the population (percent) seems to be fairly great and a positive association is seen between both variables. Even with very small p-values for the estimated coefficients, the value of \(R^2\) obtained though small is good enough which supports the fact that some variations in crim can be accounted for by the variations in lstat.
(lm_summary <- summary(lm(crim ~ medv, data = Boston)))
##
## Call:
## lm(formula = crim ~ medv, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## medv -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
coeff_data$SLR[12] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = medv, y = crim)) +
geom_point() +
geom_smooth(method = "lm", color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
Similar to the relation between crim and lstat, the relation between the
per capita crime rate by town and the median value of owner-occupied
homes in \(\$1000s\) is good enough
with the value of \(R^2\) slightly
significant to support the assertion that some variation in crim can be
explained by variations in medv. The only difference here is that the
relation is rather negative.
(fit_summary <- summary(lm(crim ~ ., data = Boston)))
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
From the above results, for the predictors dis ,
rad, and medv , we fail to reject the null
hypothesis that \(H_0 : \beta_j = 0\)
considering their marginal p-values falling very low below \(0.05\). Moreover, the joint model possess a
higher results for the \(R^2\) and
adjusted-\(R^2\) which is an
improvement in the individual relations which were considered at
first.
coeff_data$MLR <- fit_summary$coefficients[,1][2:13]
(coeff_data)
## # A tibble: 12 × 2
## SLR MLR
## <dbl> <dbl>
## 1 -0.0739 0.0457
## 2 0.510 -0.0584
## 3 -1.89 -0.825
## 4 31.2 -9.96
## 5 -2.68 0.629
## 6 0.108 -0.000848
## 7 -1.55 -1.01
## 8 0.618 0.612
## 9 0.0297 -0.00378
## 10 1.15 -0.304
## 11 0.549 0.139
## 12 -0.363 -0.220
plot(coeff_data$SLR, coeff_data$MLR, xlab = "Simple Linear Regression Coefficients", ylab = "Multiple Linear Regression Coefficients", main = "Plot of SLR coefficients against MLR coefficients")
The above graph indicates the plot of the coefficients from the simple
linear regression against the coefficients obtained using the multiple
linear regression model. Clearly, the coefficients for some predictors
are affected by the presence of other predictors in the multiple linear
regression. It is clear that, better results are obtained when a
multiple linear regression model is used in fitting the data as compared
to considering individual fits.
#For each of the predictors, fit a polynomial of degree 3....
#summary(lm(Boston$crim ~ poly(Boston$zn, 3)))
#Run a for loop and record the p-value for each ... compare the p-values and determine which is significant...
polynomial_pvals <- tibble(names = rep(NA_character_, 12), pval = numeric(12))
for (i in 2:ncol(Boston)) {
predictor <- names(Boston)[i]
# Ensure that the predictor is numeric for poly() function
if (is.numeric(Boston[[predictor]])) {
formula <- as.formula(paste("crim ~ poly(", predictor, ", 3, raw = TRUE)"))
lm_poly <- lm(formula, data = Boston)
smry_poly <- summary(lm_poly)
# Store predictor name and its corresponding p-value
polynomial_pvals$names[i-1] <- predictor
# Extract F-statistic and its p-value
f <- smry_poly$fstatistic
if (all(!is.na(f))) {
f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)
polynomial_pvals$pval[i-1] <- f_pval
} else {
polynomial_pvals$pval[i-1] <- NA
}
} else {
polynomial_pvals$pval[i-1] <- NA }
}
(polynomial_pvals)
## # A tibble: 12 × 2
## names pval
## <chr> <dbl>
## 1 zn 1.28e- 6
## 2 indus 1.55e-32
## 3 chas 2.09e- 1
## 4 nox 3.81e-38
## 5 rm 1.07e- 7
## 6 age 1.02e-20
## 7 dis 3.14e-35
## 8 rad 2.31e-55
## 9 tax 7.34e-50
## 10 ptratio 4.17e-13
## 11 lstat 1.35e-26
## 12 medv 4.45e-59
After fitting polynomial curve of degree 3 to each of the predictors with the response, the p-values recorded are infinitesimally small and insignificant except for the Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) which is a categorical data. From my observation, there is no non-linear association between the response variable and any of the predictor variables.
f_2_9 <- function(x) {
-4.5 * ((x - 10) / 50)^3 + 9 * ((x - 10) / 50)^2 + 4.2
}
x_range_tibble <- tibble(x = c(0, 100))
set.seed(1)
n_train <- 45
n_test <- 15
var_epsilon <- 2
test_data <-
tibble(
x = runif(n = n_test, min = 0, max = 100),
true_f_x = f_2_9(x),
y = true_f_x + rnorm(n = n_test, mean = 0, sd = sqrt(var_epsilon)),
type = "te"
)
train_data <-
tibble(
x = runif(n = n_train, min = 0, max = 100),
true_f_x = f_2_9(x),
y = true_f_x + rnorm(n = n_train, mean = 0, sd = sqrt(var_epsilon)),
type = "tr"
)
df_seq <- seq(1, 10, 1)
MSEP <- tibble(df = rep(NA, length(df_seq)), train = NA, test = NA)
for (i in 1:length(df_seq)) {
MSEP$df[i] <- df_seq[i]
smsp_2_9 <- lm(y ~ poly(x, df_seq[i], raw = TRUE), data = train_data)
pred_train <- predict(smsp_2_9, newdata = train_data)
MSEP$train[i] <- mean( (train_data$y - pred_train) ^ 2 )
pred_test <- predict(smsp_2_9, newdata = test_data)
MSEP$test[i] <- mean( (test_data$y - pred_test) ^ 2 )
}
MSEP_long <- pivot_longer(MSEP, cols = c(train, test), names_to = "Subset", values_to = "MSEP")
g_MSEP0 <-
ggplot(MSEP_long) +
geom_line(aes(x = df, y = MSEP, color = Subset)) +
theme_bw()
min_test_df <-
MSEP %>%
filter(test == min(test)) %>%
pull(df)
(ng <-
g_MSEP0 +
xlab("Flexibility") +
ylab("Mean Squared Error") +
geom_point(data = MSEP_long %>% filter(df == min_test_df), aes(x = df, y = MSEP), color = "black", shape = "X", size = 3))
ggplotly(ng)
From the above plots, it is evident that the training MSEP decreases as flexibility increases and this is due to the fact that increase in the degree of the polynomials fit the training data more closely. Morevoer, it can be inferred from the above that the training MSEP undergoes an initial decreases as flexibility increases, reaching a minimum at the optimal degree, then increases due to overfitting. The best bias-variance trade off is seen at the point where testing MSEP is minimized and is represented in the graph. Beyond that, there is the issue of overfitting as seen in the continual increase in the degrees of the polynomial(degrees of freedom).
optimal_model <- lm(y ~ poly(x, min_test_df, raw = TRUE), data = train_data)
plot_data <- tibble(x = seq(0, 100, length.out = 1000))
plot_data <- plot_data %>%
mutate(
true_f_x = f_2_9(x),
predicted_f_x = predict(optimal_model, newdata = plot_data)
)
ggplot() +
geom_line(data = plot_data, aes(x = x, y = true_f_x, color = "True Curve"), linewidth = 1) +
geom_line(data = plot_data, aes(x = x, y = predicted_f_x, color = "Least MSE Polynomial"), linewidth = 1) +
geom_point(data = train_data, aes(x = x, y = y, color = "Training Data"), alpha = 0.5) +
geom_point(data = test_data, aes(x = x, y = y, color = "Testing Data"), alpha = 0.5) +
theme_bw() +
xlab("x") +
ylab("y") +
ggtitle("True Function vs. Estimated Function") +
scale_color_manual(values = c("True Curve" = "black", "Least MSE Polynomial" = "orange", "Training Data" = "red", "Testing Data" = "green")) +
theme(legend.title = element_blank()) # Optional: removes legend title
The graph above shows the plot of the train and test data with a fit of the true curve as given in the problem statement as well as the estimated polynomial regression with the least MSEP. Although the estimated polynomial fit does not match the exact curve, there is a resemblance and the variations can be due to the errors in estimating the weights or parameters for the model. The estimated regression function with the lowest MSEP for the testing set is a polynomial of degree \(( k = \text{min_test_df} )\). The function can be written as:
\[ \hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \dots + \hat{\beta}_k x^k \]
where $ _0, _1, , _k $ are the coefficients obtained from fitting the polynomial regression model to the training data. These coefficients can be extracted in R using:
coefficients(optimal_model)
## (Intercept) poly(x, min_test_df, raw = TRUE)1
## 6.319791e+00 -1.991694e-01
## poly(x, min_test_df, raw = TRUE)2 poly(x, min_test_df, raw = TRUE)3
## 6.765512e-03 -4.634246e-05
From the results the coefficients are: \(\hat{\beta_0}=6.319791\), \(\hat{\beta_1}=-0.1991694\), \(\hat{\beta_2}=0.006765512\), and \(\hat{\beta_3}=-0.00004634246\). Thus, the estimated regression function is \[\hat{f}(x)=6.319791-0.1991694x+0.006765512x^{2}-0.00004634246x^{3}\]
The true function is given by \[f(x)=-4.5\bigg(\dfrac{x-10}{50}\bigg)^{3}+9\bigg(\dfrac{x-10}{50}\bigg)^{2}+4.2\] \[f(x)=-0.000036x^{3}+0.00468x^{2}-0.0828x+4.596.\] The estimated regression function with the lowest MSEP, \(\hat{f}(x)\), is a cubic polynomial, which corresponds to the form of the true function, \(f(x)\). This can be shown by using Wolfram alpha to simplify the polynomial and comparing the results in the estimated polynomial and the result. The model has effectively captured the underlying relationship, as evidenced by the estimated function’s coefficients being near to the genuine coefficients. The training set’s unpredictability or data noise are probably to blame for the small variations in the intercept and linear term. All things considered, the estimated function gives a decent approximation of the true function and generalizes well to the testing data.
# Bias-Variance Trade-off -----
f_2_9 <- function(x) {
-4.5 * ((x - 10) / 50)^3 + 9 * ((x - 10) / 50)^2 + 4.2
}
n_train <- 45
n_test <- 15
var_epsilon <- 2
set.seed(1)
# This is our single one and only test set:
test_data <-
tibble(
x = runif(n = n_test, min = 0, max = 100),
true_f_x = f_2_9(x),
y = true_f_x + rnorm(n_test, sd = sqrt(var_epsilon))
)
df_seq <- seq(1, 10, 1)
# We'll repeat the simulation 100 times:
M <- 100
# Data frame for all results (we need to average over MSE / bias / var for each x over many datasets,
# so we need to store all the predictions as a first step, then compute point-wise MSE / bias / var
# and average over those)
# Important difference between this and the previous simulation:
# Here, we need to compute the variance and bias for each point. So
# we can't just sample a training dataset, compute a statistic (MSE)
# save it, and repeat. We need to save ALL predictions from all training
# datasets, and post-hoc compute the variance for each point across all datasets.
predictions <- NULL
for (m in 1:M) {
if (m%%5 == 0) print(m)
# Sample a training dataset: (many times)
train_data <-
tibble(
x = runif(n = n_train, min = 0, max = 100),
true_f_x = f_2_9(x),
y = true_f_x + rnorm(n_train, sd = sqrt(var_epsilon)),
type = "tr"
)
# For each dataset, compute avg. bias, avg. variance and test MSE for each level of flexibility:
for (i in 1:length(df_seq)) {
smsp_2_9 <- lm( y ~ poly(x,df_seq[i],raw=TRUE), data = train_data)
pred_test <-
as_tibble(predict(smsp_2_9, newdata = test_data)) %>%
rename(pred_f_x = value) %>%
mutate(
true_f_x = test_data$true_f_x,
true_y = test_data$y,
test_obs_idx = factor(row_number()),
m = m,
df = df_seq[i]
)
predictions <- bind_rows(predictions, pred_test)
}
}
## [1] 5
## [1] 10
## [1] 15
## [1] 20
## [1] 25
## [1] 30
## [1] 35
## [1] 40
## [1] 45
## [1] 50
## [1] 55
## [1] 60
## [1] 65
## [1] 70
## [1] 75
## [1] 80
## [1] 85
## [1] 90
## [1] 95
## [1] 100
(predictions)
## # A tibble: 15,000 × 6
## pred_f_x true_f_x true_y test_obs_idx m df
## <dbl> <dbl> <dbl> <fct> <int> <dbl>
## 1 5.66 5.02 5.01 1 1 1
## 2 6.41 6.14 9.54 2 1 1
## 3 7.82 8.44 9.52 3 1 1
## 4 10.2 8.71 7.58 4 1 1
## 5 5.22 4.53 2.91 5 1 1
## 6 10.1 8.83 8.42 6 1 1
## 7 10.4 8.19 7.77 7 1 1
## 8 8.43 9.17 8.59 8 1 1
## 9 8.21 8.95 9.30 9 1 1
## 10 4.23 4.25 2.99 10 1 1
## # ℹ 14,990 more rows
bias_var_MSE <-
predictions %>% # What does this object contain?
group_by(test_obs_idx, df) %>% # What does this line do?
summarize(
pt_wise_bias_sq = (mean(pred_f_x - true_f_x)) ^ 2,
pt_wise_var = mean( (pred_f_x - mean(pred_f_x)) ^ 2 ),
pt_wise_MSEE = mean( (pred_f_x - true_f_x) ^ 2 ),
pt_wise_MSEP = mean( (pred_f_x - true_y) ^ 2 )
) %>%
# How do the above formulas (in conjunction with the group_by and summarize functions)
# relate to the mathematical formulas for Bias, Variance, MSEE and MSEP?
group_by(df) %>% # What does this line do?
summarize(
bias_sq = mean(pt_wise_bias_sq),
var = mean(pt_wise_var),
MSEE = mean(pt_wise_MSEE),
MSEP = mean(pt_wise_MSEP)
)
## `summarise()` has grouped output by 'test_obs_idx'. You can override using the
## `.groups` argument.
# What is the resulting object bias_var_MSE? What does it contain (rows, columns, values)?
# If we did this right, the summary statistic MSEP should equal the one we computed for Figure 2.9:
#all.equal(
# MSEP_summary %>% filter(Subset == "test") %>% pull(mean_MSEP),
# bias_var_MSE$MSEP
#)
#all.equal(bias_var_MSE$MSEE, "")
# Always true
#est_var_epsilon <- ""
est_var_epsilon <- mean(bias_var_MSE$MSEP - bias_var_MSE$MSEE)
#bias_var_MSE$MSEP - (bias_var_MSE$bias_sq + bias_var_MSE$var + est_var_epsilon)
# True only in the limit
g <-
bias_var_MSE %>%
pivot_longer(c(bias_sq, var, MSEE, MSEP)) %>%
ggplot() +
geom_line(aes(x = df, y = value, color = name)) +
geom_hline(yintercept = est_var_epsilon, lty = 2) +
theme_bw() +
xlab("Flexibility") +
ylab("Mean Squared Error") +
theme(legend.title = element_blank())
ggplotly(g)
** pred_f_x: The estimated value after fitting a
polynomial model to the test data for each observation. **
true_f_x: This column contains the actual true values for
the original model \(f_{2.9}(x)\) for
each test observation. ** true_y: This column contains the
true values of y from the simulated model for each given test
observation. ** test_obs_idx: In this column, there are a
set of unique identifiers for each of the observations for each of the
15 test observation indices. Thus for each \(m\) and each \(df\), there is a prediction for each of the
15 individual test observations. ** m: Out of the total
\(M\) simulations we run for each
degree of polynomial, \(m\) records the
simulation number out of \(M\). **
df: This column records the degree of the polynomial used
for a given simulation.
group_by(test_obs_idx, df) takes the tibble from the
prediction and splits the data into groups according to the test_obs_idx
as well as the degree of the polynomial before finding the summary
parameters( thus summarizing the data). It makes use of the
group_by function in ordering the data. The grouping is
done first according to the test_obs_idx and then followed by the df. We
then perform a summary statistics on it.
Beyond the group_by(), is the
summarize() where we define our summary statistics. Here,
we perform a point-wise summary statistics by using all the saved
predictions for each of every \(15\)
set of test observations and their corresponding degrees of freedom.We
do this by computing the variance and bias for each point as well as the
MSE and MSEP for the points in the \(M\) number of simulations we run and then
average the results the results to find the point-wise statistic for
every \(15\) test set observations and
their corresponding degrees of polynomial. With \(M\), total number of simulations for a
given polynomial and a 15 number of test observations, the variable
names and formulas defined in the summarize() function
conforms to the following: \(pt\_wise\_bias\_sq =
[\dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-f(x))]^2\) which is the bias
for the 15 test set for a given \(df\)
and all \(m\). Similarly, \(pt\_wise\_var =
\dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-\bar{f(x)})^2\), \(\bar{f(x)}\) the mean of the predictions.
And we through that estimate the MSEE and MSEP for each \(15\) test set observation for all \(m\) and a given \(df\). \(pt\_wise\_MSEE =
\dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-f(x))^2\) \(pt\_wise\_MSEP =
\dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-y)^2\).
The result is passed on as a new data to be further summarized
and this time around, grouped by the degrees of freedom(The degrees of
the polynomial). Thus each degrees of freedom now has an estimate for
its 15 point-wise estimates corresponding to the 15 test observations.
For each 15-test set corresponding to a given df, we
average the results and put for Bias, Variance, MSEP and MSEE and
organize them under the given degrees of freedom.
The resulting object bias_var_MSE is a tibble with 10 records and 5 columns with each record corresponding to a specific degree of the polynomial(which is the first column) and the other columns include the values for the Bias, Variances, MSEE and MSEP that corresponds to each of the degrees of freedom(the degree of a given polynomial).